Model correction of diagnostic coding-based RSV incidence for children 0–4 years in the US

Background Although administrative claims data have a high degree of completeness, not all medically attended Respiratory Syncytial Virus-associated lower respiratory tract infections (RSV-LRTIs) are tested or coded for their causative agent. We sought to determine the attribution of RSV to LRTI in claims data via modeling of temporal changes in LRTI rates against surveillance data. Methods We estimated the weekly incidence of LRTI (inpatient, outpatient, and total) for children 0–4 years using 2011–2019 commercial insurance claims, stratified by HHS region, matched to the corresponding weekly NREVSS RSV and influenza positivity data for each region, and modelled against RSV, influenza positivity rates, and harmonic functions of time assuming negative binomial distribution. LRTI events attributable to RSV were estimated as predicted events from the full model minus predicted events with RSV positivity rate set to 0. Results Approximately 42% of predicted RSV cases were coded in claims data. Across all regions, the percentage of LRTI attributable to RSV were 15–43%, 10–31%, and 10–31% of inpatient, outpatient, and combined settings, respectively. However, when compared to coded inpatient RSV-LRTI, 9 of 10 regions had improbable corrected inpatient LRTI estimates (predicted RSV/coded RSV ratio < 1). Sensitivity analysis based on separate models for PCR and antigen-based positivity showed similar results. Conclusions Underestimation based on coding in claims data may be addressed by NREVSS-based adjustment of claims-based RSV incidence. However, where setting-specific positivity rates is unavailable, we recommend modeling across settings to mirror NREVSS’s positivity rates which are similarly aggregated, to avoid inaccurate adjustments. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-024-09474-y.

Epidemiologic estimates of the burden of RSV infections as described above are typically derived from prospective cohorts, which may be limited in sample size and generalizability, or from automated healthcare data collected during routine clinical care of cases.Administrative claims data are an important source of health services utilization data repurposed for this type of epidemiologic research.Although claims data offer the advantage of having a high degree of completeness due to their use for reimbursement, not all medically attended LRTI are tested for RSV and assigned respective International Classification of Disease (ICD) codes that facilitate measurement of RSV-specific LRTI incidence rates.This could potentially underestimate RSV incidence in claims data.To correct for underestimation, LRTI cases that might be attributable to RSV can be modelled against data with RSV positivity to determine the proportion of cases due to RSV.Two previous studies modelled LRTI hospitalizations collected in the Healthcare Cost and Utilization Project (HCUP) data against RSV test positivity data collected by National Respiratory and Enteric Virus Surveillance System (NREVSS) from the Centers for Disease Control and Prevention (CDC) [11,12].Importantly, because NREVSS does not provide information on patient characteristics (age, disease type or severity) or the setting (inpatient/outpatient), the models had to be based on the assumption that overall RSV positivity results provided by NREVSS are representative of the RSV positivity in the study population, i.e., LRTI hospitalizations.However, we have demonstrated previously that the propensity to test for RSV as well as RSV positivity varies appreciably by patient age, disease type and severity, and setting [13] and thus, use of aggregate RSV positivity information as available from NREVSS may yield inaccurate adjustments for RSV undertesting when analyzing claims data.
We aimed to estimate the proportion of LRTI episodes attributable to RSV and evaluate the performance of NREVSS-based RSV incidence correction models in claims data when applied separately to inpatient versus outpatient LRTI episodes.

Study design and population
We utilized a retrospective cohort study design to model 2011-2019 MarketScan® Commercial Claims and Medicare Supplemental data (MarketScan) and NREVSS virology testing data obtained from the U.S. Centers for Disease Control and Prevention (CDC) corresponding to the same time period.Participants were children in MarketScan 0 to 4 years of age who were continuously enrolled across an RSV year (July-June), in non-capitated health plans, and with prescription benefits.Enrollees in capitated health plans were excluded to avoid incomplete documentation of services received.Continuous enrollment was defined as a gap of less than 3 days between the end of one enrollment period and the next.The University of Florida Institutional Review Board exempt the study from review due to use of de-identified data.

MarketScan® commercial claims and medicare supplemental data (MarketScan)
MarketScan provides administrative claims data for a national sample of persons across the US with employersponsored health insurance, their spouses, and dependents.Data provided include information on beneficiary demographic characteristics, medical encounters with associated diagnoses and procedures, and dispensed prescription drugs.Over 130 million lives were covered between 2011 and 2019.

Virology data
Weekly data on RSV and influenza tests are collated by the CDC through the National Respiratory and Enteric Virus Surveillance System (NREVSS) from a sample of clinical and public health laboratories across the US.The RSV-related data provide the total number of RSV tests conducted each week and the number of positive RSV tests, stratified by test type (antigen, viral isolation/ culture, or polymerase chain reaction (PCR)) and by US Department of Health and Human Services (HHS) regions.Similarly, the influenza-related data include the total number of influenza tests conducted each week in each HHS region and the number of positive influenza tests for influenza A and B. Starting in 2015, NREVSS began to report influenza data separately for clinical and public health laboratories.We excluded 2015-2019 influenza data from public health laboratories because the data are restricted to positive specimens used mainly for surveillance to identify circulating strains.In both the RSV and Influenza datasets, weekly data start on the first Saturday of each year.

Attribution of LRTI to RSV
We determined weekly LRTI incidence rates from Mar-ketScan data in the inpatient and outpatient setting, stratified by age (0, 1, 2, 3 and 4 years) and HHS region.LRTI-related events were identified using ICD version 9 or 10 clinical modification codes considering codes that do or do not identify a specific pathogen (eTable 1).LRTIs with any specific pathogen (RSV or not) were included because all could have been candidates for RSV testing and contribute to RSV positivity results in NREVSS.To ensure that infection was the reason for hospital admission, we included principal diagnoses of LRTI as well as secondary diagnoses of LRTI with a primary diagnosis of a medical condition that is directly related to the infection, such as respiratory distress (eTable 2).A unique RSV episode was defined as a cluster of LRTI-related claims 30 or more days apart from the next adjacent claim.Inpatient episodes were identified before outpatient episodes with ± 30 days around each inpatient episode excluded from the outpatient risk period.This hierarchical approach prioritizing inpatient episodes was to ensure that all severe cases were captured.Weekly LRTI incidence rates, following the NREVSS definition of a week, were estimated as the total weekly counts of unique LRTIs (events) divided by the total number of days at risk (population-time) in each week.

Preparation of analytical dataset
We used corresponding weekly NREVSS data for RSV and influenza to determine the proportion of LRTI events attributable to RSV.RSV and influenza positivity rates were, respectively, determined as proportion of total number of RSV and influenza tests conducted that week that were positive.We then modeled the weekly number of LRTI events against RSV and influenza positivity rates (NREVSS data) using negative binomial models with days at risk as the log offset variable as shown in Eq. 1.
where Y (i) represents the number of LRTI events during a given week i, α is the offset term and is equal to the log of the population size in each age group and region, t is a running index of the weeks between July 2011 to June 2019 where 1 is assigned to week 1, 2 to week 2 and so on, sin and cos are harmonic functions of t accounting for seasonal events, and β7 through β9 represent coefficients associated with the proportion of standardized specimens testing positive during a given week in the NREVSS data [12,14].Age was not an input parameter for the model as separate predictions were made for each age stratum.Because the majority of RSV positivity results are based on either antigen or PCR results [13], we considered positivity results from both test types in our model.The RSV variable was a combination of antigen and PCR positivity rates weighted by the weekly distribution of tests conducted.For example, if the total number of antigen and PCR tests conducted in a week were 200 and 300 respectively, the positivity rates for the corresponding tests in NREVSS data for that week were weighted by 0.4 and 0.6 respectively and then summed to obtain the weighted RSV positivity rates.This assumed that the distribution of tests by test type was similar in both the NREVSS and MarketScan cohort from which data was collected.

Analysis
To estimate the number of LRTI events attributable to RSV, we adapted the approach by Zhou et al. [11] We estimated the predicted number of events with all parameters defined in the full model and with RSV positivity rate set to 0 (reduced model).The predicted number of events in the reduced model was then subtracted from the predicted number of events in the full model to obtain the number of LRTI events attributable to RSV.We used the 95% confidence interval for the RSV coefficient in the reduced model to calculate 95% upper and lower confidence limits for the number of RSV cases attributed to LRTI.We deviated from Zhou et al. 's approach by not excluding LRTI cases with diagnostic codes specifying RSV or influenza from the model.In their approach, only LRTIs without RSV or influenza designation were modeled and the estimated excess RSV cases were then added to the excluded RSV cases in the claims data.This approach could potentially mis-specify the model and result in overestimation of RSV incidence rates, because NREVSS positivity data is a reflection of tests ordered for all LRTI cases (i.e., with or without confirmed RSV).Separate models were run by HHS region and then aggregated to obtain the total number of LRTI events attributed to RSV.Strata with negative estimates were substituted with 0. RSV incidence was estimated as the number of predicted RSV cases attributed to LRTI divided by the total number of person-years at risk for LRTI.We conducted sensitivity analyses considering RSV test types.Specifically, to check the robustness of our modeling approach and assumptions, separate models estimating the attribution of RSV to LRTI were run for antigen and PCR tests.This had the same assumption as the initial model but had the potential to overestimate RSV-attributable LRTI episodes since the predictor was all LRTI episodes identified in MarketScan data.Hence, the weekly attribution from each model was weighted by the weekly distribution of tests before aggregation.

Results
Across July 2011 to June 2019, there were 54 872 and 1 432 300 unique episodes coded for LRTI in the inpatient and outpatient files in MarketScan for a total of 1 487 172, making up 42.1% of the 351 258 model-predicted RSV-attributable LRTI episodes (Table 1 Table 2 shows estimated RSV LRTI incidence rates based on the modelling of LRTI episodes from Mar-ketScan data against RSV and influenza testing and positivity from NREVSS data.The overall inpatient RSV LRTI incidence was 1.74 (1.11-2.27)cases per 1 000 personyears, ranging between 0.85 (0.18-1.43,San Francisco) and 3.04 (1.94-3.96,Denver) cases per 1 000 personyears.The outpatient incidence was 38.55 (29.91-46.56,ranging between 14.75 (4.45-24.23,San Francisco) and 60.85 (50.46-70.46,Dallas). Figure 1 shows the ratio of number of model-estimated RSV LRTI (LRTI episodes attributed to RSV) to number of LRTI cases coded for RSV in MarketScan data (LRTI episodes with an RSV diagnostic code).Nationally, the ratio of number of model-estimated RSV LRTI to RSV-coded LRTI episodes was 0.73, 2.69, and 2.37 for inpatient and outpatient cases and combined, respectively, i.e., the model predicted fewer inpatient RSV-LRTI episodes than were coded as RSV-associated.for inpatient episodes.By region, this ranged from 0.38 to 1.07 for inpatient cases, 1.74-4.51for outpatient cases, and 1.46-3.56for both settings combined.With the exception of Atlanta, the ratio of predicted RSV to coded RSV cases was less than 1 in the inpatient setting while all regions had ratios greater than 1 for the outpatient setting and both settings combined.Sensitivity analyses based on separate models for the RSV test types showed similar estimates of ratios less than 1 for inpatient ), and greater than 1 for outpatient and combined settngs.

Discussion
In this study, we applied modeling techniques to correct for underestimation of RSV-LRTI incidence rates common with administrative claims data.Taking advantage of seasonal variation of RSV and influenza infections, the model estimates temporal changes in LRTI incidences as a function of temporal changes in RSV and influenza test positivity data in NREVSS.The degree to which temporal changes in LRTI rates are explained by variations in positivity allows for the estimation of RSV attribution to LRTIs.
Our study findings arrive at similar estimate as a prospective study in 2009 which included 5 067 children < 5 years of age with acute respiratory illness (ARIs).The authors found that 20% of inpatient admissions, 18% of emergency visits and 15% of outpatient visits for ARI were attributable to RSV [9].Although these numbers were lower than our model (29% for inpatients and 24% for outpatients), both studies indicate a larger proportion of inpatient visits were attributable to RSV when compared to outpatient visits.The RSV incidence per 1 000 children of this prospective study was 2.1-3.7 for hospitalizations and 61-80 for outpatient visits compared to our 1.7 and 38.5, respectively.
Our inpatient incidence of 173.67 per 100 000 children years is within the 47-360 range obtained for children 0-4 years using a similar modeling approach as ours applied to HCUP data [15].Our estimate is, however, lower than the 514 per 100 000 children years obtained in a different modelling study which included a broad selection of medical conditions ranging from diseases and symptoms of the respiratory system to viral infections of unspecified site in National Inpatient Sample data [12], unlike our study which modeled only LRTIs.This difference in modeled disease between both studies may be more significant than other factors such as time period, cohort selection and data source, or modeling parameters.
Novel to our work in comparison to previous reports is the contrast of modeled RSV LRTI incidence estimates to those directly obtained from RSV-coded encounters in the claims data.The benchmark study by Zhou et al. reported that the model-based RSV hospitalization rate was about 74-76% that of ICD code-based rates for children under 5 years, i.e. a predicted to coded RSV ratio of 0.74-0.76[11].While this finding is probable, we argue that the conceptual underpinnings of the methodological approach used by the authors has a tendency to overestimate coded RSV.The authors excluded LRTI cases with RSV codes from their model and added the newly predicted cases from the model to those coded cases, In contrast, we considered all LRTI cases regardless of RSV coding to reflect the entirety of cases that would have contributed to the NREVSS data, expecting that the model corrects for RSV undercoding.Our approach resulted in model-predicted to coded RSV ratios greater than 1 for outpatient LRTIs but underestimated the contribution of RSV to inpatient LRTIs when compared to the coded data in MarketScan.In other words, correction for non-coding of RSV-associated inpatient LRTI events resulted in improbable estimates in 9 of the 10 HHS regions (Fig. 1).This finding is contrary to expectations since coding of RSV-LRTI in claims data is expected to undercount true RSV-attributable LRTI episodes, given that not all LRTI cases are tested for confirmatory pathogenic causation.Sensitivity analysis based on separate models for PCR and antigen-based positivity showed similar results.
A potential explanation for the unexpected finding is the lack of setting-specific positivity rates in NREVSS data.Given that the model for both inpatient and outpatient RSV-attributable LRTI are based on total RSV tests, it is conceivable that modeling inpatient LRTI rates from MarketScan data against RSV positivity rate based on total RSV tests from NREVSS data would result in inaccurate predictions which would minimally impart outpatient estimations since the majority of identified LRTI cases occur in the outpatient setting (1,432,300 vs. 54,872).This is significant as it highlights an important limitation of using NREVSS data to estimate RSV-LRTI [11].We, therefore, recommend that in-and outpatient LRTIs are only modeled combined to align with RSV data from NREVSS and avoid inaccurate adjustments.Ideally, positivity rates used for modeling would be available for the target population that provides the cases for modeling, e.g., for the study at hand, positivity rates specific to children under age 5 years with LRTI stratified by setting would be used.
One of the strengths of our study is that it includes both inpatient and outpatient models unlike previous studies that only included an inpatient model, and that it allows a contrast between RSV attribution based on modeling versus ICD coding.It also addresses methodological limitations of those studies, such as the inclusion of all age groups and the assumed model distribution.It should be noted that even our combined estimates (across in-and outpatient episodes) or outpatient-specific estimates may be slightly inaccurate because NREVSS data are not specific to age or indication, even though most tests are expected to represent children of younger age.
A limitation of our study is that MarketScan data captures medically attended pediatric LRTIs only in children who are privately insured, resulting in underestimates of the overall disease burden in the US since RSV infections are less common among the commercially-insured compared to the publicly-insured population [16,17].Furthermore, a single LRTI episode could be associated with multiple encounters in both inpatient and outpatient files, which could potentially lead to double counting LRTI events.We minimized this limitation by applying an algorithm to identify unique LRTI inpatient and outpatient episodes requiring at least 30 days between individual episodes and applying a hierarchical algortihm that prioritized inpatient events.We were unable to model RSV-LRTI incidence by age because NREVSS data does not provide RSV positivity rate by age.Although most RSV cases occur in children under 5 years, the incidence of RSV is markedly different for each year of age from 0 to 4 years.Our model of LRTI incidence against RSV positivity aggregated across age takes into account this limitation.As discussed in our recent work, age-specific modeling by test type, test indication, clinical setting, and region should be a preferred method to better estimate RSV attribution to LRTIs [13].

Conclusions
Underestimation based on coding in claims data may be addressed by NREVSS-based adjustment of claims-based RSV incidence.However, correction for under-coding of RSV-associated inpatient LRTI episodes resulted in improbable estimates of inpatient RSV LRTI incidence rates in most HHS regions.Where setting-specific positivity rates is unavailable, we recommend modelling claims data aggregated across settings, in order to NREVSS's positivity rates which are similarly aggregated.This approach would avoid inaccurate adjustments.

Fig. 1
Fig. 1 Ratio of number of RSV-coded LRTI episodes to model-estimated number of RSV-LRTI episodes for children 0-4 years in the US

Table 1
Number and Percentages of LRTI episodes Coded as RSV and attributed to RSV by Model Estimation for children 0-4 Years in the US

Table 2
Model-estimated annual RSV incidence for children 0-4 years in the US Abbreviations LRTI = Lower Respiratory Tract Infection, PY = Person-Years; RSV = Respiratory Syncytial Virus